返回部分匹配的正则表达式

Question

我有一个包含目的地列表和坐标的 kml 文件。此文件中大约有 40 多个目的地。我正在尝试从中解析坐标，当您查看文件时，您会看到 "coordinates"..."/coordinates" 所以找到它们不会是困难的部分，但我看不到得到完整的结果。我的意思是，它会剪掉-94。或从头开始的任何负浮点数，并打印其余部分。

#!/usr/bin/python3.5

import re

def main():

    results = []
    with open("file.kml","r") as f:
        contents = f.readlines()

    if f.mode == 'r':
        print("reading file...")
        for line in contents:
            coords_match = re.search(r"(<coordinates>)[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)
            if coords_match:
                coords_matchh = coords_match.group()
                print(coords_matchh)

这是我得到的一些结果

3502969,38.8555497
7662462,38.8583916
6280323,38.8866337
3655059,39.3983001

这就是文件中的格式，如果它有所不同

<coordinates>
  -94.5944738,39.031411,0
</coordinates>

如果我修改这一行，并从头开始删除坐标

coords_match = re.search(r"[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)",line)

这是我得到的结果。

-94.7662462
-94.6280323
-94.3655059

这基本上就是我想要的结果。

-94.7662462,38.8583916
-94.6280323,38.8866337
-94.3655059,39.3983001

Answer 1

虽然使用实际的解析器是一种方法，但正如@Kendas 在评论中所建议的那样，您可以尝试 findall 而不是 search

>>> import re
>>> s = """<coordinates>
...   -94.5944738,39.031411,0
... </coordinates>"""
>>> re.findall(r'[+-]?\d+\.\d+|\d+\,\-?\d+\.\d+|\d+(?=</coordinates)', s)
['-94.5944738', '39.031411']

Answer 2

您也可以使用 BeauitfulSoup 来获取坐标，因为它将是 XML/HTML 一种解析。

from bs4 import BeautifulSoup

text = """<coordinates>
              -94.5944738,39.031411,0
            </coordinates>
            <coordinates>
              -94.59434738,39.032311,0
            </coordinates>
            <coordinates>
              -94.523444738,39.0342411,0
            </coordinates>"""
soup = BeautifulSoup(text, "lxml")
coordinates = soup.findAll('coordinates')

for i in range(len(coordinates)):
    print(coordinates[i].text.strip()[:-2])

输出：

-94.5944738,39.031411
-94.59434738,39.032311
-94.523444738,39.0342411

Answer 3

如果您只想提取简单且分隔明确的数据，XML 解析器就有点过分了。

主要是使用更简单正则表达式，并搜索整个文件。专注于捕捉标签之间的一切：

with open("file.kml","r") as f:
    contents = f.read()
coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', contents, re.DOTALL)

这将 return 匹配列表。此列表中的每一项都将如下所示：

'\n  -94.5944738,39.031411,0\n  '

因此对于每个项目，您需要：

去掉空格
在最后一个“,”上右拆分
舍弃第二个结果。

所以你这样做：

results = [c.strip().rsplit(',', 1)[0] for c in coords_match]

这会为您提供所需字符串的列表。

如果你真的想使用数字，我会把数字转换成浮点数（使用嵌套理解）：

results = [tuple(float(f) for f in  c.strip().split(',')[:2]) for c in coords_match]

这将为您提供 float.

的二元组列表

IPython中的演示：

In [1]: import re                                                                                        

In [2]: text = """<coordinates> 
   ...:               -94.5944738,39.031411,0 
   ...:             </coordinates> 
   ...:             <coordinates> 
   ...:               -94.59434738,39.032311,0 
   ...:             </coordinates> 
   ...:             <coordinates> 
   ...:               -94.523444738,39.0342411,0 
   ...:             </coordinates>"""                                                                    

In [3]: coords_match = re.findall(r'<coordinates>(.*?)</coordinates>', text, re.DOTALL)                  
Out[3]: 
['\n              -94.5944738,39.031411,0\n            ',
 '\n              -94.59434738,39.032311,0\n            ',
 '\n              -94.523444738,39.0342411,0\n            ']

In [4]: results1 = [c.strip().rsplit(',', 1)[0] for c in coords_match]                                   
Out[4]: ['-94.5944738,39.031411', '-94.59434738,39.032311', '-94.523444738,39.0342411']

In [5]: results2 = [tuple(float(f) for f in  c.strip().split(',')[:2]) for c in coords_match]            
Out[5]: 
[(-94.5944738, 39.031411),
 (-94.59434738, 39.032311),
 (-94.523444738, 39.0342411)]

编辑： 如果您想将数据保存为 SJON，那么最好使用浮点数转换。因为那可以直接转换为JSON:

In [6]: import json

In [7]: print(json.dumps(results2, indent=2))                                                            
[
  [
    -94.5944738,
    39.031411
  ],
  [
    -94.59434738,
    39.032311
  ],
  [
    -94.523444738,
    39.0342411
  ]
]

返回部分匹配的正则表达式

Regular expressions returning partial matches

python

python-3.5