拆分唯一字符串 - Python
Splitting a unique string - Python
我正在尝试找到解析此类字符串的最佳方法:
Operating Status: NOT AUTHORIZED Out of Service Date: None
我需要这样的输出:
['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']
有没有简单的方法可以做到这一点?我正在像这样解析数百个字符串。没有确定性文本,但它始终采用上述格式。
其他字符串示例:
MC/MX/FF Number(s): None DUNS Number: --
Power Units: 1 Drivers: 1
预期输出:
['MC/MX/FF Number(s): None, 'DUNS Number: --']
['Power Units: 1, Drivers: 1 ']
有两种方法。两者都非常笨拙,并且极度依赖于原始字符串中非常小的波动。但是,您可以修改代码以提供更多的灵活性。
这两个选项都取决于满足这些特征的线路...
有问题的分组必须...
- 以字母或斜杠开头,可能大写
- 感兴趣的标题后跟一个冒号(“:”)
- 只抓取冒号后的第一个单词。
方法一,regex,这个只能抓取两块数据。第二组是 "everything else" 因为我无法让搜索模式正确重复 :P
代码:
import re
l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]
pattern = ''.join([
"(", # Start capturing group
"\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash
".+?\:", # any character (non-greedy) up to and including the colon
"\s*", # One or more spaces
"\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
")", # End capturing group
"(.*)"
])
for s in l:
m = re.search(pattern, s)
print("----------------")
try:
print(m.group(1))
print(m.group(2))
print(m.group(3))
except Exception as e:
pass
输出:
----------------
MC/MX/FF Number(s): None
DUNS Number: --
----------------
Power Units: 1
Drivers: 1
方法二,逐字解析字符串。该方法与正则表达式的基本特征相同,但可以做两个以上感兴趣的块。它的工作原理...
- 开始逐字解析每个字符串,并将其加载到
newstring
.
- 遇到冒号时,标记一个标志。
- 将下一个循环的第一个单词添加到
newstring
。如果需要,您可以将其更改为 1-2、1-3 或 1-n 字。您也可以让它在设置 colonflag
之后继续添加单词,直到满足某些条件,例如带有大写字母的单词……尽管这可能会中断 "None." 之类的单词您可以一直持续到一个单词满足所有大写字母,但是 not-all-capital header 会破坏它。
- 将
newstring
添加到newlist
,重置标志,并继续解析单词。
代码:
s = 'MC/MX/FF Number(s): None DUNS Number: -- '
for s in l:
newlist = []
newstring = ""
colonflag = False
for w in s.split():
newstring += " " + w
if colonflag:
newlist.append(newstring)
newstring = ""
colonflag = False
if ":" in w:
colonflag = True
print(newlist)
输出:
[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']
第三个选项:
创建一个包含所有预期的 header 的列表,例如 header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]
,并根据这些列表创建 split/parse。
第四个选项
使用 Natural Language Processing 和机器学习来实际找出逻辑句子的位置 ;)
看看pyparsing。这似乎是表达单词组合、检测它们之间的关系(语法上)并生成结构化响应的最 'natural' 方式……网上有大量教程和文档:
- Using the pyparsing module
- Getting Started with Pyparsing
- Pyparseltongue: Parsing Text with Pyparsing
您可以使用“pip install pyparsing”安装 pyparsing
正在解析:
Operating Status: NOT AUTHORIZED Out of Service Date: None
需要这样的东西:
!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# test_pyparsing2.py
#
# Copyright 2019 John Coppens <john@jcoppens.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA.
#
#
import pyparsing as pp
def create_parser():
opstatus = pp.Keyword("Operating Status:")
auth = pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")
status = pp.Keyword("Out of Service Date:")
date = pp.Keyword("None")
part1 = pp.Group(opstatus + auth)
part2 = pp.Group(status + date)
return part1 + part2
def main(args):
parser = create_parser()
msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
msg = "Operating Status: AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
return 0
if __name__ == '__main__':
import sys
sys.exit(main(sys.argv))
运行程序:
[['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
[['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
使用 Combine
和 Group
您可以更改输出的组织格式。
我正在尝试找到解析此类字符串的最佳方法:
Operating Status: NOT AUTHORIZED Out of Service Date: None
我需要这样的输出:
['Operating Status: NOT AUTHORIZED', 'Out of Service Data: None']
有没有简单的方法可以做到这一点?我正在像这样解析数百个字符串。没有确定性文本,但它始终采用上述格式。
其他字符串示例:
MC/MX/FF Number(s): None DUNS Number: --
Power Units: 1 Drivers: 1
预期输出:
['MC/MX/FF Number(s): None, 'DUNS Number: --']
['Power Units: 1, Drivers: 1 ']
有两种方法。两者都非常笨拙,并且极度依赖于原始字符串中非常小的波动。但是,您可以修改代码以提供更多的灵活性。
这两个选项都取决于满足这些特征的线路... 有问题的分组必须...
- 以字母或斜杠开头,可能大写
- 感兴趣的标题后跟一个冒号(“:”)
- 只抓取冒号后的第一个单词。
方法一,regex,这个只能抓取两块数据。第二组是 "everything else" 因为我无法让搜索模式正确重复 :P
代码:
import re
l = [ 'MC/MX/FF Number(s): None DUNS Number: -- ', 'Power Units: 1 Drivers: 1 ' ]
pattern = ''.join([
"(", # Start capturing group
"\s*[A-Z/]", # Any number of space, until and including only the first capital or forward slash
".+?\:", # any character (non-greedy) up to and including the colon
"\s*", # One or more spaces
"\w+\s*", # One or more alphanumeric chars i.e. [a-zA-Z0-9]
")", # End capturing group
"(.*)"
])
for s in l:
m = re.search(pattern, s)
print("----------------")
try:
print(m.group(1))
print(m.group(2))
print(m.group(3))
except Exception as e:
pass
输出:
----------------
MC/MX/FF Number(s): None
DUNS Number: --
----------------
Power Units: 1
Drivers: 1
方法二,逐字解析字符串。该方法与正则表达式的基本特征相同,但可以做两个以上感兴趣的块。它的工作原理...
- 开始逐字解析每个字符串,并将其加载到
newstring
. - 遇到冒号时,标记一个标志。
- 将下一个循环的第一个单词添加到
newstring
。如果需要,您可以将其更改为 1-2、1-3 或 1-n 字。您也可以让它在设置colonflag
之后继续添加单词,直到满足某些条件,例如带有大写字母的单词……尽管这可能会中断 "None." 之类的单词您可以一直持续到一个单词满足所有大写字母,但是 not-all-capital header 会破坏它。 - 将
newstring
添加到newlist
,重置标志,并继续解析单词。
代码:
s = 'MC/MX/FF Number(s): None DUNS Number: -- '
for s in l:
newlist = []
newstring = ""
colonflag = False
for w in s.split():
newstring += " " + w
if colonflag:
newlist.append(newstring)
newstring = ""
colonflag = False
if ":" in w:
colonflag = True
print(newlist)
输出:
[' MC/MX/FF Number(s): None', ' DUNS Number: --']
[' Power Units: 1', ' Drivers: 1']
第三个选项:
创建一个包含所有预期的 header 的列表,例如 header_list = ["Operating Status:", "Out of Service Date:", "MC/MX/FF Number(s):", "DUNS Number:", "Power Units:", "Drivers:", ]
,并根据这些列表创建 split/parse。
第四个选项
使用 Natural Language Processing 和机器学习来实际找出逻辑句子的位置 ;)
看看pyparsing。这似乎是表达单词组合、检测它们之间的关系(语法上)并生成结构化响应的最 'natural' 方式……网上有大量教程和文档:
- Using the pyparsing module
- Getting Started with Pyparsing
- Pyparseltongue: Parsing Text with Pyparsing
您可以使用“pip install pyparsing”安装 pyparsing
正在解析:
Operating Status: NOT AUTHORIZED Out of Service Date: None
需要这样的东西:
!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# test_pyparsing2.py
#
# Copyright 2019 John Coppens <john@jcoppens.com>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
# MA 02110-1301, USA.
#
#
import pyparsing as pp
def create_parser():
opstatus = pp.Keyword("Operating Status:")
auth = pp.Combine(pp.Optional(pp.Keyword("NOT"))) + pp.Keyword("AUTHORIZED")
status = pp.Keyword("Out of Service Date:")
date = pp.Keyword("None")
part1 = pp.Group(opstatus + auth)
part2 = pp.Group(status + date)
return part1 + part2
def main(args):
parser = create_parser()
msg = "Operating Status: NOT AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
msg = "Operating Status: AUTHORIZED Out of Service Date: None"
print(parser.parseString(msg))
return 0
if __name__ == '__main__':
import sys
sys.exit(main(sys.argv))
运行程序:
[['Operating Status:', 'NOT', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
[['Operating Status:', '', 'AUTHORIZED'], ['Out of Service Date:', 'None']]
使用 Combine
和 Group
您可以更改输出的组织格式。