使用 python 从 .docx 文件中提取 GPS 坐标
Extract GPS coordinates from .docx file with python
我有一些繁忙的任务要做,需要 python 的帮助。请看这个word文档。
我要从每一行中提取文本和 GPS 坐标。目前在 10 个 docx 文件中有超过 100 个坐标。我的 "hefty" python 知识让我明白了这一点。
from docx import Document
import re
main_file = Document("D:/DOCUMENTS/Google_Link/1 Category I/1 Category
I.docx")
table = main_file.tables[1] #this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]
listReference = filter(regexReference.match, colReference)
for i in listReference:
print i.encode('UTF-8')
我可以从第 2 列打印 16 个参考 ID。请指导我打印这样的内容。
C1-20701-17-1
some site, some region
The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires
some repair/maintenance works including electrical wiring and electrical
lights and appliances like ceiling fans supplies. Detail specification of
the works are attached
x = 91°38'28.2"E
y = 22°40'34.3"N
这些 XY 位置和描述将用于随后创建 KML 文件并附在每个文档中。我希望为上一节的每个部分(ref id、位置、描述、x 和 y)设置一个变量,这样我也可以将其自动化。
我不知道如果有不同模式的文件是否有效(p.s。我正在使用 python 2.7.11):
# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
for root, dirs, files in os.walk("."):
for name in files:
doc_file = os.path.join(root, name)
if doc_file.endswith('docx'):
main_file = Document(doc_file)
table = main_file.tables[1] # this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-[0-9-]+)")
regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
result = []
for item in data:
tmp = dict()
matchReference = regexReference.search(item[1])
matchCoordinate = regexCoordinate.search(unicode(item[2]))
if matchReference:
tmp['reference'] = matchReference.group()
if matchCoordinate:
tmp['x'] = matchCoordinate.group(1)
tmp['y'] = matchCoordinate.group(4)
tmp['description'] = unicode(item[2])
tmp['location'] = unicode(item[3])
result.append(tmp)
for rs in result:
if 'reference' in rs:
for k, v in rs.iteritems():
print('{} = {}'.format(k, v))
print
# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region
我有一些繁忙的任务要做,需要 python 的帮助。请看这个word文档。
我要从每一行中提取文本和 GPS 坐标。目前在 10 个 docx 文件中有超过 100 个坐标。我的 "hefty" python 知识让我明白了这一点。
from docx import Document
import re
main_file = Document("D:/DOCUMENTS/Google_Link/1 Category I/1 Category
I.docx")
table = main_file.tables[1] #this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]
listReference = filter(regexReference.match, colReference)
for i in listReference:
print i.encode('UTF-8')
我可以从第 2 列打印 16 个参考 ID。请指导我打印这样的内容。
C1-20701-17-1
some site, some region
The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires
some repair/maintenance works including electrical wiring and electrical
lights and appliances like ceiling fans supplies. Detail specification of
the works are attached
x = 91°38'28.2"E
y = 22°40'34.3"N
这些 XY 位置和描述将用于随后创建 KML 文件并附在每个文档中。我希望为上一节的每个部分(ref id、位置、描述、x 和 y)设置一个变量,这样我也可以将其自动化。
我不知道如果有不同模式的文件是否有效(p.s。我正在使用 python 2.7.11):
# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
for root, dirs, files in os.walk("."):
for name in files:
doc_file = os.path.join(root, name)
if doc_file.endswith('docx'):
main_file = Document(doc_file)
table = main_file.tables[1] # this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-[0-9-]+)")
regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
result = []
for item in data:
tmp = dict()
matchReference = regexReference.search(item[1])
matchCoordinate = regexCoordinate.search(unicode(item[2]))
if matchReference:
tmp['reference'] = matchReference.group()
if matchCoordinate:
tmp['x'] = matchCoordinate.group(1)
tmp['y'] = matchCoordinate.group(4)
tmp['description'] = unicode(item[2])
tmp['location'] = unicode(item[3])
result.append(tmp)
for rs in result:
if 'reference' in rs:
for k, v in rs.iteritems():
print('{} = {}'.format(k, v))
print
# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region