如何将 html table 的单元格跨越多个列转换为 Python 3 中的列表列表?
How to convert a html table with cells that span into several columns into a list of lists in Python 3?
我是 Python 的新手,已经开始了一个需要一些网络抓取的小项目。我开始使用 BS4,但在尝试将具有跨越多个列的单元格的 html table 转换为列表列表(在 Python 3 中)时,我遇到了一点困难。
我想将此 html table 转换为列表列表,以便能够使用终端 tables 以文本模式打印它。所以,我试图让一些空列表单元格填充行的其余部分,只要有一个 HTML 单元格跨越 5 列。
我想我可能把一些在(流利)Python 中可以更容易完成的事情过于复杂化了。有人可以帮忙吗?
我此时的代码:
#!/usr/local/bin/python3
# encoding: utf-8
# just did a lot of experiments, so I will need to clean these imports! (some of them are related to the rest of the project anyway)
import sys
import os
import os.path
import csv
import re
from textwrap import fill as tw_fill
from random import randint
from datetime import datetime, timedelta
from copy import deepcopy
from platform import node
from colorclass import Color
from urllib3 import PoolManager
from bleach import clean
from bs4 import BeautifulSoup
from terminaltables import SingleTable
def obter_estado_detalhado(tracking_code):
""" Verify detailed tracking status for CTT shipment
Ex: obter_estado_detalhado("EA746000000PT")
"""
ctt_url = "http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=" + tracking_code + "&showResults=true"
estado = "- N/A -"
dados_tracking = [[
"Hora",
"Estado",
"Motivo",
"Local",
"Recetor"
]
]
# try:
http = PoolManager()
r = http.urlopen('GET', ctt_url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")
records = dados_tracking
table2 = soup.find_all('table')[1]
l = 1
c = 0
for linha in table2.find_all('tr')[1:]:
records.append([])
for celula in linha.find_all('td')[1:]:
txt = clean(celula.string, tags=[], strip=True).strip()
records[l].append(txt)
c += 1
l += 1
tabela = SingleTable(records)
print(tabela.table)
print(records)
tabela = SingleTable(records)
print(tabela.table)
exit() # This exit is only for testing purposes...
obter_estado_detalhado("EA746813946PT")
示例 HTML 代码 (as in this link):
<table class="full-width">
<thead>
<tr>
<th>
Nº de Objeto
</th>
<th>
Produto
</th>
<th>
Data
</th>
<th>
Hora
</th>
<th>
Estado
</th>
<th>
Info
</th>
</tr>
</thead>
<tbody><tr>
<td>
EA746813813PT
</td>
<td>19</td>
<td>2016/03/31</td>
<td>09:40</td>
<td>
Objeto entregue
</td>
<td class="truncate">
<a id="detailsLinkShow_0" onclick="toggleObjectDetails('0', true);" class="hide">[+]Info</a>
<a id="detailsLinkHide_0" class="" onclick="toggleObjectDetails('0', false);">[-]Info</a>
</td>
</tr>
<tr></tr>
<tr id="details_0" class="">
<td colspan="6">
<div class="full-width-table-scroller"><table class="full-width">
<thead>
<tr>
<th>Hora</th>
<th>Estado</th>
<th>Motivo</th>
<th>Recetor</th>
</tr>
</thead>
<tbody><tr>
</tr>
<tr class="group">
<td colspan="5">quinta-feira, 31 Março 2016</td>
</tr><tr><td>09:40</td>
<td>Entrega conseguida</td>
<th>Local</th><td>-</td>
<td>4470 - MAIA</td>
<td>DONIEL MARQUES</td>
</tr>
<tr>
<td>08:32</td>
<td>Em distribuição</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>08:29</td>
<td>Receção no local de entrega</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>08:29</td>
<td>Receção nacional</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>00:17</td>
<td>Envio</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
</tr><tr class="group">
<td colspan="5">quarta-feira, 30 Março 2016</td>
</tr>
<tr><td>23:40</td>
<td>Expedição nacional</td>
<td>-</td>
<td>C.O. PERAFITA (OPE)</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Receção no local de entrega</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Receção nacional</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Aceitação</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
</tbody></table></div>
</td>
</tr>
</tbody></table>
这与主要 table 输出匹配:
from bs4 import BeautifulSoup
html = requests.get("http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=EA746813946PT&showResults=true").content
soup = BeautifulSoup(html)
# get table using id
rows = soup.select("#details_0")[0]
# get the header names and strip whitespace
cols = [th.text.strip() for th in rows.select("th")]
# extract all td's from each table row, the list comp will data grouped row wise.
data = [[td.text.strip() for td in tr.select("td")] for tr in rows.select("tr")]
print(" ".join(cols))
for row in data:
print(", ".join(row))
输出:
Hora Estado Motivo Local Recetor
terça-feira, 5 Abril 2016
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -
segunda-feira, 4 Abril 2016
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -
网站:
这是解析器,我想我试过了所有的方法,但唯一有效的是 html5 使用 soup = BeautifulSoup(html,"html5")
输出:
Hora Estado Motivo Local Recetor
terça-feira, 5 Abril 2016
11:02, Entrega conseguida, -, 4000 - PORTO, CANDIDA VIEGAS
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -
segunda-feira, 4 Abril 2016
21:45, Receção no local de entrega, -, C. O. PERAFITA, -
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -
我是 Python 的新手,已经开始了一个需要一些网络抓取的小项目。我开始使用 BS4,但在尝试将具有跨越多个列的单元格的 html table 转换为列表列表(在 Python 3 中)时,我遇到了一点困难。
我想将此 html table 转换为列表列表,以便能够使用终端 tables 以文本模式打印它。所以,我试图让一些空列表单元格填充行的其余部分,只要有一个 HTML 单元格跨越 5 列。
我想我可能把一些在(流利)Python 中可以更容易完成的事情过于复杂化了。有人可以帮忙吗?
我此时的代码:
#!/usr/local/bin/python3
# encoding: utf-8
# just did a lot of experiments, so I will need to clean these imports! (some of them are related to the rest of the project anyway)
import sys
import os
import os.path
import csv
import re
from textwrap import fill as tw_fill
from random import randint
from datetime import datetime, timedelta
from copy import deepcopy
from platform import node
from colorclass import Color
from urllib3 import PoolManager
from bleach import clean
from bs4 import BeautifulSoup
from terminaltables import SingleTable
def obter_estado_detalhado(tracking_code):
""" Verify detailed tracking status for CTT shipment
Ex: obter_estado_detalhado("EA746000000PT")
"""
ctt_url = "http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=" + tracking_code + "&showResults=true"
estado = "- N/A -"
dados_tracking = [[
"Hora",
"Estado",
"Motivo",
"Local",
"Recetor"
]
]
# try:
http = PoolManager()
r = http.urlopen('GET', ctt_url, preload_content=False)
soup = BeautifulSoup(r, "html.parser")
records = dados_tracking
table2 = soup.find_all('table')[1]
l = 1
c = 0
for linha in table2.find_all('tr')[1:]:
records.append([])
for celula in linha.find_all('td')[1:]:
txt = clean(celula.string, tags=[], strip=True).strip()
records[l].append(txt)
c += 1
l += 1
tabela = SingleTable(records)
print(tabela.table)
print(records)
tabela = SingleTable(records)
print(tabela.table)
exit() # This exit is only for testing purposes...
obter_estado_detalhado("EA746813946PT")
示例 HTML 代码 (as in this link):
<table class="full-width">
<thead>
<tr>
<th>
Nº de Objeto
</th>
<th>
Produto
</th>
<th>
Data
</th>
<th>
Hora
</th>
<th>
Estado
</th>
<th>
Info
</th>
</tr>
</thead>
<tbody><tr>
<td>
EA746813813PT
</td>
<td>19</td>
<td>2016/03/31</td>
<td>09:40</td>
<td>
Objeto entregue
</td>
<td class="truncate">
<a id="detailsLinkShow_0" onclick="toggleObjectDetails('0', true);" class="hide">[+]Info</a>
<a id="detailsLinkHide_0" class="" onclick="toggleObjectDetails('0', false);">[-]Info</a>
</td>
</tr>
<tr></tr>
<tr id="details_0" class="">
<td colspan="6">
<div class="full-width-table-scroller"><table class="full-width">
<thead>
<tr>
<th>Hora</th>
<th>Estado</th>
<th>Motivo</th>
<th>Recetor</th>
</tr>
</thead>
<tbody><tr>
</tr>
<tr class="group">
<td colspan="5">quinta-feira, 31 Março 2016</td>
</tr><tr><td>09:40</td>
<td>Entrega conseguida</td>
<th>Local</th><td>-</td>
<td>4470 - MAIA</td>
<td>DONIEL MARQUES</td>
</tr>
<tr>
<td>08:32</td>
<td>Em distribuição</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>08:29</td>
<td>Receção no local de entrega</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>08:29</td>
<td>Receção nacional</td>
<td>-</td>
<td>4470 - MAIA</td>
<td>-</td>
</tr>
<tr>
<td>00:17</td>
<td>Envio</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
</tr><tr class="group">
<td colspan="5">quarta-feira, 30 Março 2016</td>
</tr>
<tr><td>23:40</td>
<td>Expedição nacional</td>
<td>-</td>
<td>C.O. PERAFITA (OPE)</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Receção no local de entrega</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Receção nacional</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
<tr>
<td>20:39</td>
<td>Aceitação</td>
<td>-</td>
<td>C. O. PERAFITA</td>
<td>-</td>
</tr>
</tbody></table></div>
</td>
</tr>
</tbody></table>
这与主要 table 输出匹配:
from bs4 import BeautifulSoup
html = requests.get("http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=EA746813946PT&showResults=true").content
soup = BeautifulSoup(html)
# get table using id
rows = soup.select("#details_0")[0]
# get the header names and strip whitespace
cols = [th.text.strip() for th in rows.select("th")]
# extract all td's from each table row, the list comp will data grouped row wise.
data = [[td.text.strip() for td in tr.select("td")] for tr in rows.select("tr")]
print(" ".join(cols))
for row in data:
print(", ".join(row))
输出:
Hora Estado Motivo Local Recetor
terça-feira, 5 Abril 2016
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -
segunda-feira, 4 Abril 2016
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -
网站:
这是解析器,我想我试过了所有的方法,但唯一有效的是 html5 使用 soup = BeautifulSoup(html,"html5")
输出:
Hora Estado Motivo Local Recetor
terça-feira, 5 Abril 2016
11:02, Entrega conseguida, -, 4000 - PORTO, CANDIDA VIEGAS
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -
segunda-feira, 4 Abril 2016
21:45, Receção no local de entrega, -, C. O. PERAFITA, -
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -