从 .docx 文件解析 table
Parsing of table from .docx file
我想使用 Python 和 python-docx 将 .docx 文件中的 table 解析为一些有用的数据结构。
在我的例子中,.docx 文件只包含一个 table。我 uploaded it so you can have a look。这是屏幕截图:
您可以使用下面的代码片段将您的文档解析为一个列表,其中每一行都是一个字典,将 table header 值映射到列值。
from docx.api import Document
# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
document = Document('Books.docx')
table = document.tables[0]
# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
这会给你:
data = [
{u'Pub.': u'Penguin Books',
u'Auther': u'Edward de BONO',
u'Sr. No.': u'1',
u'Name of Book': u'Six Thinking Hats'
},
...
]
如果您只想为每一行创建一个元组,您应该将 row_data
设置为 text
的元组值而不是创建字典,因此在循环中而不是构造dict
,做:
# Construct a tuple for this row
row_data = tuple(text)
data.append(row_data)
现在,data
会持有这样的东西:
data = [
(u'1',
u'Six Thinking Hats',
u'Edward de BONO',
u'Penguin Books'
),
...
]
然后你可以跳过构造 keys
,显然(但仍然跳过第一行!)。
我想使用 Python 和 python-docx 将 .docx 文件中的 table 解析为一些有用的数据结构。
在我的例子中,.docx 文件只包含一个 table。我 uploaded it so you can have a look。这是屏幕截图:
您可以使用下面的代码片段将您的文档解析为一个列表,其中每一行都是一个字典,将 table header 值映射到列值。
from docx.api import Document
# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
document = Document('Books.docx')
table = document.tables[0]
# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
这会给你:
data = [
{u'Pub.': u'Penguin Books',
u'Auther': u'Edward de BONO',
u'Sr. No.': u'1',
u'Name of Book': u'Six Thinking Hats'
},
...
]
如果您只想为每一行创建一个元组,您应该将 row_data
设置为 text
的元组值而不是创建字典,因此在循环中而不是构造dict
,做:
# Construct a tuple for this row
row_data = tuple(text)
data.append(row_data)
现在,data
会持有这样的东西:
data = [
(u'1',
u'Six Thinking Hats',
u'Edward de BONO',
u'Penguin Books'
),
...
]
然后你可以跳过构造 keys
,显然(但仍然跳过第一行!)。