如何提取 HTML table 并添加一个新列,其中包含来自早期 <strong> 标记的常量值?
How do you extract a HTML table and add a new column with constant values from an earlier <strong> tag?
我正在尝试从 HTML 文档中提取一系列 table,并从用作 header 的标签中附加一个具有常量值的新列。然后我们的想法是让这个新的三列 table 成为一个数据框。以下是我到目前为止提出的代码。 IE。每个 table 将有第三列,其中所有行值将等于 AGO、DPK、ATK 或 PMS,具体取决于哪个 header 在 table 系列之前。我是 python 和 HTML 的新手,如有任何帮助,我将不胜感激。谢谢磨坊!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A [=13=].5/L
Coy B [=13=].6/L
Coy C [=13=].7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A [=13=].5/L AGO
Coy B [=13=].6/L AGO
Coy C [=13=].7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
希望我理解你的问题。此脚本会将页面中找到的所有表格抓取到数据框中,并将其保存到 csv 文件:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
打印:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
data.csv
(来自 LibreOffice 的屏幕截图):
我正在尝试从 HTML 文档中提取一系列 table,并从用作 header 的标签中附加一个具有常量值的新列。然后我们的想法是让这个新的三列 table 成为一个数据框。以下是我到目前为止提出的代码。 IE。每个 table 将有第三列,其中所有行值将等于 AGO、DPK、ATK 或 PMS,具体取决于哪个 header 在 table 系列之前。我是 python 和 HTML 的新手,如有任何帮助,我将不胜感激。谢谢磨坊!
import pandas as pd
from bs4 import BeautifulSoup
from robobrowser import RoboBrowser
br = RoboBrowser()
br.open("https://oilpriceng.net/03-09-2019")
table = br.find_all('td', class_='vc_table_cell')
for element in table:
data = element.find('span', class_='vc_table_content')
prod_name = br.find_all('strong')
ago = prod_name[0].text
dpk = prod_name[1].text
atk = prod_name[2].text
pms = prod_name[3].text
if br.find('strong').text == ago:
data.append(ago.text)
elif br.find('strong').text == dpk:
data.append(dpk.text)
elif br.find('strong').text == atk:
data.append(atk.text)
elif br.find('strong').text == pms:
data.append(pms.text)
print(data.text)
df = pd.DataFrame(data)
The result i'm hoping for is to go from this
AGO
Enterprise Price
Coy A [=13=].5/L
Coy B [=13=].6/L
Coy C [=13=].7/L
to the new table below as a dataframe in Pandas
Enterprise Price Product
Coy A [=13=].5/L AGO
Coy B [=13=].6/L AGO
Coy C [=13=].7/L AGO
and to repeat the same thing for other tables with DPK, ATK and PMS information
希望我理解你的问题。此脚本会将页面中找到的所有表格抓取到数据框中,并将其保存到 csv 文件:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://oilpriceng.net/03-09-2019/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data, last = {'Enterprise':[], 'Price':[], 'Product':[]}, ''
for tag in soup.select('h1 strong, tr:has(td.vc_table_cell)'):
if tag.name == 'strong':
last = tag.get_text(strip=True)
else:
a, b = tag.select('td')
a, b = a.get_text(strip=True), b.get_text(strip=True)
if a and b != 'DEPOT PRICE':
data['Enterprise'].append(a)
data['Price'].append(b)
data['Product'].append(last)
df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv')
打印:
Enterprise Price Product
0 AVIDOR PH ₦190.0 AGO
1 SHORELINK AGO
2 BULK STRATEGIC PH ₦190.0 AGO
3 TSL AGO
4 MASTERS AGO
.. ... ... ...
165 CHIPET ₦132.0 PMS
166 BOND PMS
167 RAIN OIL PMS
168 MENJ ₦133.0 PMS
169 NIPCO ₦ 2,9000,000 LPG
[170 rows x 3 columns]
data.csv
(来自 LibreOffice 的屏幕截图):