为 Azure Table 存储解析和处理 HTML table

Parsing and processing HTML table for Azure Table Storage

我有一个从报告系统生成的 HTML 输出文件。我想使用 Python 将 HTML 中的数据推送到 Azure table storage。 Python 相对较新,不确定如何正确操作。

HTML 行如下所示:

<tr>
    <td>Data Type</td>
    <td>RandomID</td>
    <td>Random Title</td>
    <td>Foo</td>
    <td>Bar</td>
    <td></td>
    <td>Random Data</td>
    <td>Another random data</td>
</tr>

代码如下:

        f=codecs.open("generatedReport.html", 'r')
        html_data = f.read()
        parsedHtml = BeautifulSoup(html_data)
        htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})

        #List for Data/Value of the Entity
        table_data = [[cell.text for cell in row("td")]
                     for row in htmldata_parsed("tr")]

        #List for Header/Keys of the Entity
        table_header = [[cell.text for cell in row("th")]
                     for row in htmldata_parsed("tr")]

        for i in table_data:
            indI = 0
            id = uuid.uuid1() 
            task = Entity()
            task.PartitionKey = "PartKey"
            task.RowKey = id.hex
            for c in table_header[0]:
                indC = 0 
                keyName = c.replace('\n','').replace('\t','').replace('\r','').strip()
                keyValue = i[indC:(indC+1)] #this is where I think the issue is.
                task[keyName] = keyValue
                indC = indC+1
            indI = indI+1
            print(task)

输出:

  {'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': ['    Data Type\n'], 'ID': ['    Data Type\n'], 'Title': ['    Data Type\n'], 'Column1': ['    Data Type\n'], 'Column2': ['    Data Type\n'], 'Column2': ['    Data Type\n'], 'Column3': ['    Data Type\n'], 'Column4': ['    Data Type\n']}

预期输出:

{'PartitionKey': 'PartKey', 'RowKey': '0d1b5a3a8b4f11e99a87a44cc87947c7', 'Type': 'Data Type', 'ID': 'RandomID', 'Title': 'Random Title', 'Column1': 'Foo', 'Column2': 'Bar', 'Column2': '', 'Column3': 'Random Data', 'Column4': 'Another random data'}

作为参考,根据我的经验,我在 Python 3 中重写了您的代码以满足您的预期输出,作为输入 html 和我下面的示例代码。

generatedReport.html

<html>
<body>
<table id="issuetable">
<tr>
    <th>Type</th>
    <th>ID</th>
    <th>Title</th>
    <th>Column1</th>
    <th>Column2</th>
    <th>Column3</th>
    <th>Column4</th>
    <th>Column5</th>
</tr>
<tr>
    <td>Data Type</td>
    <td>RandomID</td>
    <td>Random Title</td>
    <td>Foo</td>
    <td>Bar</td>
    <td></td>
    <td>Random Data</td>
    <td>Another random data</td>
</tr>
<tr>
    <td>Data Type1</td>
    <td>RandomID1</td>
    <td>Random Title1</td>
    <td>Foo1</td>
    <td>Bar1</td>
    <td></td>
    <td>Random Data1</td>
    <td>Another random data1</td>
</tr>
</table>
</body>
</html>

这是我的示例代码。

from bs4 import BeautifulSoup
import uuid
import json

f=open("generatedReport.html", 'r')
html_data = f.read()
parsedHtml = BeautifulSoup(html_data)
htmldata_parsed = parsedHtml.find("table", {"id": "issuetable"})

#List for Data/Value of the Entity
table_data = [[cell.text for cell in row("td")] for row in htmldata_parsed("tr") if row("td")]

#List for Header/Keys of the Entity
table_header = [cell.text for row in htmldata_parsed("tr") for cell in row("th") if row("th")]

# combine the table header with each row data to generate task as dict
tasks = [dict(task for task in zip(table_header, row)) for row in table_data]
#print(tasks)

# Add the partitionKey and rowKey into each task dict
[task.update({'PartitionKey': 'PartKey', 'RowKey': uuid.uuid1().hex}) for task in tasks]
#print(tasks)

for task in tasks:
    json_str = json.dumps(task)
    print(json_str)

输出如下:

{"Type": "Data Type", "ID": "RandomID", "Title": "Random Title", "Column1": "Foo", "Column2": "Bar", "Column3": "", "Column4": "Random Data", "Column5": "Another random data", "PartitionKey": "PartKey", "RowKey": "56ac9b8c8b6411e98740f48e38aa7f99"}
{"Type": "Data Type1", "ID": "RandomID1", "Title": "Random Title1", "Column1": "Foo1", "Column2": "Bar1", "Column3": "", "Column4": "Random Data1", "Column5": "Another random data1", "PartitionKey": "PartKey", "RowKey": "56ad37688b6411e99bd7f48e38aa7f99"}

与 json 键顺序无关。不用担心。