返回部分匹配的正则表达式
Regular expressions returning partial matches
我有一个包含目的地列表和坐标的 kml 文件。此文件中大约有 40 多个目的地。我正在尝试从中解析坐标,当您查看文件时,您会看到 "coordinates"..."/coordinates" 所以找到它们不会是困难的部分,但我看不到得到完整的结果。我的意思是,它会剪掉-94。或从头开始的任何负浮点数,并打印其余部分。
#!/usr/bin/python3.5
import re
def main():
results = []
with open("file.kml","r") as f:
contents = f.readlines()
if f.mode == 'r':
print("reading file...")
for line in contents:
coords_match = re.search(r"(<coordinates>)[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
if coords_match:
coords_matchh = coords_match.group()
print(coords_matchh)
这是我得到的一些结果
3502969,38.8555497
7662462,38.8583916
6280323,38.8866337
3655059,39.3983001
这就是文件中的格式,如果它有所不同
<coordinates>
-94.5944738,39.031411,0
</coordinates>
如果我修改这一行,并从头开始删除坐标
coords_match = re.search(r"[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
这是我得到的结果。
-94.7662462
-94.6280323
-94.3655059
这基本上就是我想要的结果。
-94.7662462,38.8583916
-94.6280323,38.8866337
-94.3655059,39.3983001
虽然使用实际的解析器是一种方法,但正如@Kendas 在评论中所建议的那样,您可以尝试 findall
而不是 search
>>> import re
>>> s = """<coordinates>
... -94.5944738,39.031411,0
... </coordinates>"""
>>> re.findall(r'[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)', s)
['-94.5944738', '39.031411']
您也可以使用 BeauitfulSoup 来获取坐标,因为它将是 XML/HTML 一种解析。
from bs4 import BeautifulSoup
text = """<coordinates>
-94.5944738,39.031411,0
</coordinates>
<coordinates>
-94.59434738,39.032311,0
</coordinates>
<coordinates>
-94.523444738,39.0342411,0
</coordinates>"""
soup = BeautifulSoup(text, "lxml")
coordinates = soup.findAll('coordinates')
for i in range(len(coordinates)):
print(coordinates[i].text.strip()[:-2])
输出:
-94.5944738,39.031411
-94.59434738,39.032311
-94.523444738,39.0342411
如果您只想提取简单且分隔明确的数据,XML 解析器就有点过分了。
主要是使用更简单正则表达式,并搜索整个文件。专注于捕捉标签之间的一切:
with open("file.kml","r") as f:
contents = f.read()
coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', contents, re.DOTALL)
这将 return 匹配列表。此列表中的每一项都将如下所示:
'\n -94.5944738,39.031411,0\n '
因此对于每个项目,您需要:
- 去掉空格
- 在最后一个“,”上右拆分
- 舍弃第二个结果。
所以你这样做:
results = [c.strip().rsplit(',', 1)[0] for c in coords_match]
这会为您提供所需字符串的列表。
如果你真的想使用数字,我会把数字转换成浮点数(使用嵌套理解):
results = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
这将为您提供 float
.
的二元组列表
IPython中的演示:
In [1]: import re
In [2]: text = """<coordinates>
...: -94.5944738,39.031411,0
...: </coordinates>
...: <coordinates>
...: -94.59434738,39.032311,0
...: </coordinates>
...: <coordinates>
...: -94.523444738,39.0342411,0
...: </coordinates>"""
In [3]: coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', text, re.DOTALL)
Out[3]:
['\n -94.5944738,39.031411,0\n ',
'\n -94.59434738,39.032311,0\n ',
'\n -94.523444738,39.0342411,0\n ']
In [4]: results1 = [c.strip().rsplit(',', 1)[0] for c in coords_match]
Out[4]: ['-94.5944738,39.031411', '-94.59434738,39.032311', '-94.523444738,39.0342411']
In [5]: results2 = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
Out[5]:
[(-94.5944738, 39.031411),
(-94.59434738, 39.032311),
(-94.523444738, 39.0342411)]
编辑: 如果您想将数据保存为 SJON,那么最好使用浮点数转换。因为那可以直接转换为JSON:
In [6]: import json
In [7]: print(json.dumps(results2, indent=2))
[
[
-94.5944738,
39.031411
],
[
-94.59434738,
39.032311
],
[
-94.523444738,
39.0342411
]
]
我有一个包含目的地列表和坐标的 kml 文件。此文件中大约有 40 多个目的地。我正在尝试从中解析坐标,当您查看文件时,您会看到 "coordinates"..."/coordinates" 所以找到它们不会是困难的部分,但我看不到得到完整的结果。我的意思是,它会剪掉-94。或从头开始的任何负浮点数,并打印其余部分。
#!/usr/bin/python3.5
import re
def main():
results = []
with open("file.kml","r") as f:
contents = f.readlines()
if f.mode == 'r':
print("reading file...")
for line in contents:
coords_match = re.search(r"(<coordinates>)[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
if coords_match:
coords_matchh = coords_match.group()
print(coords_matchh)
这是我得到的一些结果
3502969,38.8555497
7662462,38.8583916
6280323,38.8866337
3655059,39.3983001
这就是文件中的格式,如果它有所不同
<coordinates>
-94.5944738,39.031411,0
</coordinates>
如果我修改这一行,并从头开始删除坐标
coords_match = re.search(r"[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
这是我得到的结果。
-94.7662462
-94.6280323
-94.3655059
这基本上就是我想要的结果。
-94.7662462,38.8583916
-94.6280323,38.8866337
-94.3655059,39.3983001
虽然使用实际的解析器是一种方法,但正如@Kendas 在评论中所建议的那样,您可以尝试 findall
而不是 search
>>> import re
>>> s = """<coordinates>
... -94.5944738,39.031411,0
... </coordinates>"""
>>> re.findall(r'[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)', s)
['-94.5944738', '39.031411']
您也可以使用 BeauitfulSoup 来获取坐标,因为它将是 XML/HTML 一种解析。
from bs4 import BeautifulSoup
text = """<coordinates>
-94.5944738,39.031411,0
</coordinates>
<coordinates>
-94.59434738,39.032311,0
</coordinates>
<coordinates>
-94.523444738,39.0342411,0
</coordinates>"""
soup = BeautifulSoup(text, "lxml")
coordinates = soup.findAll('coordinates')
for i in range(len(coordinates)):
print(coordinates[i].text.strip()[:-2])
输出:
-94.5944738,39.031411
-94.59434738,39.032311
-94.523444738,39.0342411
如果您只想提取简单且分隔明确的数据,XML 解析器就有点过分了。
主要是使用更简单正则表达式,并搜索整个文件。专注于捕捉标签之间的一切:
with open("file.kml","r") as f:
contents = f.read()
coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', contents, re.DOTALL)
这将 return 匹配列表。此列表中的每一项都将如下所示:
'\n -94.5944738,39.031411,0\n '
因此对于每个项目,您需要:
- 去掉空格
- 在最后一个“,”上右拆分
- 舍弃第二个结果。
所以你这样做:
results = [c.strip().rsplit(',', 1)[0] for c in coords_match]
这会为您提供所需字符串的列表。
如果你真的想使用数字,我会把数字转换成浮点数(使用嵌套理解):
results = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
这将为您提供 float
.
IPython中的演示:
In [1]: import re
In [2]: text = """<coordinates>
...: -94.5944738,39.031411,0
...: </coordinates>
...: <coordinates>
...: -94.59434738,39.032311,0
...: </coordinates>
...: <coordinates>
...: -94.523444738,39.0342411,0
...: </coordinates>"""
In [3]: coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', text, re.DOTALL)
Out[3]:
['\n -94.5944738,39.031411,0\n ',
'\n -94.59434738,39.032311,0\n ',
'\n -94.523444738,39.0342411,0\n ']
In [4]: results1 = [c.strip().rsplit(',', 1)[0] for c in coords_match]
Out[4]: ['-94.5944738,39.031411', '-94.59434738,39.032311', '-94.523444738,39.0342411']
In [5]: results2 = [tuple(float(f) for f in c.strip().split(',')[:2]) for c in coords_match]
Out[5]:
[(-94.5944738, 39.031411),
(-94.59434738, 39.032311),
(-94.523444738, 39.0342411)]
编辑: 如果您想将数据保存为 SJON,那么最好使用浮点数转换。因为那可以直接转换为JSON:
In [6]: import json
In [7]: print(json.dumps(results2, indent=2))
[
[
-94.5944738,
39.031411
],
[
-94.59434738,
39.032311
],
[
-94.523444738,
39.0342411
]
]