为 Azure Table 存储解析和处理 HTML table
Parsing and processing HTML table for Azure Table Storage
我有一个从报告系统生成的 HTML 输出文件。我想使用 Python 将 HTML 中的数据推送到 Azure table storage
。 Python 相对较新,不确定如何正确操作。
HTML 行如下所示:
<tr>
<td>Data Type</td>
<td>RandomID</td>
<td>Random Title</td>
<td>Foo</td>
<td>Bar</td>
<td></td>
<td>Random Data</td>
<td>Another random data</td>
</tr>
代码如下:
f=codecs.open("generatedReport.html", 'r')
html_data = f.read()
parsedHtml = BeautifulSoup(html_data)
htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})
#List for Data/Value of the Entity
table_data = [[cell.text for cell in row("td")]
for row in htmldata_parsed("tr")]
#List for Header/Keys of the Entity
table_header = [[cell.text for cell in row("th")]
for row in htmldata_parsed("tr")]
for i in table_data:
indI = 0
id = uuid.uuid1()
task = Entity()
task.PartitionKey = "PartKey"
task.RowKey = id.hex
for c in table_header[0]:
indC = 0
keyName = c.replace('\n','').replace('\t','').replace('\r','').strip()
keyValue = i[indC:(indC+1)] #this is where I think the issue is.
task[keyName] = keyValue
indC = indC+1
indI = indI+1
print(task)
输出:
{'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': [' Data Type\n'], 'ID': [' Data Type\n'], 'Title': [' Data Type\n'], 'Column1': [' Data Type\n'], 'Column2': [' Data Type\n'], 'Column2': [' Data Type\n'], 'Column3': [' Data Type\n'], 'Column4': [' Data Type\n']}
预期输出:
{'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': 'Data Type', 'ID': 'RandomID', 'Title': 'Random Title', 'Column1': 'Foo', 'Column2': 'Bar', 'Column2': '', 'Column3': 'Random Data', 'Column4': 'Another random data'}
作为参考,根据我的经验,我在 Python 3 中重写了您的代码以满足您的预期输出,作为输入 html 和我下面的示例代码。
generatedReport.html
<html>
<body>
<table id="issuetable">
<tr>
<th>Type</th>
<th>ID</th>
<th>Title</th>
<th>Column1</th>
<th>Column2</th>
<th>Column3</th>
<th>Column4</th>
<th>Column5</th>
</tr>
<tr>
<td>Data Type</td>
<td>RandomID</td>
<td>Random Title</td>
<td>Foo</td>
<td>Bar</td>
<td></td>
<td>Random Data</td>
<td>Another random data</td>
</tr>
<tr>
<td>Data Type1</td>
<td>RandomID1</td>
<td>Random Title1</td>
<td>Foo1</td>
<td>Bar1</td>
<td></td>
<td>Random Data1</td>
<td>Another random data1</td>
</tr>
</table>
</body>
</html>
这是我的示例代码。
from bs4 import BeautifulSoup
import uuid
import json
f=open("generatedReport.html", 'r')
html_data = f.read()
parsedHtml = BeautifulSoup(html_data)
htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})
#List for Data/Value of the Entity
table_data = [[cell.text for cell in row("td")] for row in htmldata_parsed("tr") if row("td")]
#List for Header/Keys of the Entity
table_header = [cell.text for row in htmldata_parsed("tr") for cell in row("th") if row("th")]
# combine the table header with each row data to generate task as dict
tasks = [dict(task for task in zip(table_header, row)) for row in table_data]
#print(tasks)
# Add the partitionKey and rowKey into each task dict
[task.update({'PartitionKey': 'PartKey', 'RowKey': uuid.uuid1().hex}) for task in tasks]
#print(tasks)
for task in tasks:
json_str = json.dumps(task)
print(json_str)
输出如下:
{"Type": "Data Type", "ID": "RandomID", "Title": "Random Title", "Column1": "Foo", "Column2": "Bar", "Column3": "", "Column4": "Random Data", "Column5": "Another random data", "PartitionKey": "PartKey", "RowKey": "56ac9b8c8b6411e98740f48e38aa7f99"}
{"Type": "Data Type1", "ID": "RandomID1", "Title": "Random Title1", "Column1": "Foo1", "Column2": "Bar1", "Column3": "", "Column4": "Random Data1", "Column5": "Another random data1", "PartitionKey": "PartKey", "RowKey": "56ad37688b6411e99bd7f48e38aa7f99"}
与 json 键顺序无关。不用担心。
我有一个从报告系统生成的 HTML 输出文件。我想使用 Python 将 HTML 中的数据推送到 Azure table storage
。 Python 相对较新,不确定如何正确操作。
HTML 行如下所示:
<tr>
<td>Data Type</td>
<td>RandomID</td>
<td>Random Title</td>
<td>Foo</td>
<td>Bar</td>
<td></td>
<td>Random Data</td>
<td>Another random data</td>
</tr>
代码如下:
f=codecs.open("generatedReport.html", 'r')
html_data = f.read()
parsedHtml = BeautifulSoup(html_data)
htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})
#List for Data/Value of the Entity
table_data = [[cell.text for cell in row("td")]
for row in htmldata_parsed("tr")]
#List for Header/Keys of the Entity
table_header = [[cell.text for cell in row("th")]
for row in htmldata_parsed("tr")]
for i in table_data:
indI = 0
id = uuid.uuid1()
task = Entity()
task.PartitionKey = "PartKey"
task.RowKey = id.hex
for c in table_header[0]:
indC = 0
keyName = c.replace('\n','').replace('\t','').replace('\r','').strip()
keyValue = i[indC:(indC+1)] #this is where I think the issue is.
task[keyName] = keyValue
indC = indC+1
indI = indI+1
print(task)
输出:
{'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': [' Data Type\n'], 'ID': [' Data Type\n'], 'Title': [' Data Type\n'], 'Column1': [' Data Type\n'], 'Column2': [' Data Type\n'], 'Column2': [' Data Type\n'], 'Column3': [' Data Type\n'], 'Column4': [' Data Type\n']}
预期输出:
{'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': 'Data Type', 'ID': 'RandomID', 'Title': 'Random Title', 'Column1': 'Foo', 'Column2': 'Bar', 'Column2': '', 'Column3': 'Random Data', 'Column4': 'Another random data'}
作为参考,根据我的经验,我在 Python 3 中重写了您的代码以满足您的预期输出,作为输入 html 和我下面的示例代码。
generatedReport.html
<html> <body> <table id="issuetable"> <tr> <th>Type</th> <th>ID</th> <th>Title</th> <th>Column1</th> <th>Column2</th> <th>Column3</th> <th>Column4</th> <th>Column5</th> </tr> <tr> <td>Data Type</td> <td>RandomID</td> <td>Random Title</td> <td>Foo</td> <td>Bar</td> <td></td> <td>Random Data</td> <td>Another random data</td> </tr> <tr> <td>Data Type1</td> <td>RandomID1</td> <td>Random Title1</td> <td>Foo1</td> <td>Bar1</td> <td></td> <td>Random Data1</td> <td>Another random data1</td> </tr> </table> </body> </html>
这是我的示例代码。
from bs4 import BeautifulSoup
import uuid
import json
f=open("generatedReport.html", 'r')
html_data = f.read()
parsedHtml = BeautifulSoup(html_data)
htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})
#List for Data/Value of the Entity
table_data = [[cell.text for cell in row("td")] for row in htmldata_parsed("tr") if row("td")]
#List for Header/Keys of the Entity
table_header = [cell.text for row in htmldata_parsed("tr") for cell in row("th") if row("th")]
# combine the table header with each row data to generate task as dict
tasks = [dict(task for task in zip(table_header, row)) for row in table_data]
#print(tasks)
# Add the partitionKey and rowKey into each task dict
[task.update({'PartitionKey': 'PartKey', 'RowKey': uuid.uuid1().hex}) for task in tasks]
#print(tasks)
for task in tasks:
json_str = json.dumps(task)
print(json_str)
输出如下:
{"Type": "Data Type", "ID": "RandomID", "Title": "Random Title", "Column1": "Foo", "Column2": "Bar", "Column3": "", "Column4": "Random Data", "Column5": "Another random data", "PartitionKey": "PartKey", "RowKey": "56ac9b8c8b6411e98740f48e38aa7f99"}
{"Type": "Data Type1", "ID": "RandomID1", "Title": "Random Title1", "Column1": "Foo1", "Column2": "Bar1", "Column3": "", "Column4": "Random Data1", "Column5": "Another random data1", "PartitionKey": "PartKey", "RowKey": "56ad37688b6411e99bd7f48e38aa7f99"}
与 json 键顺序无关。不用担心。