如何获取字符串中的值?
How to grab a value in string?
如何从 Python 中的这个长字符串中提取 STOP_DATE
值?
GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION
正如其他人所展示的,您可以将其作为单行正则表达式来执行,但这样更清楚:
import re
input_data=""" GROUP = TEMPORALINFORMATION\n\n OBJECT = PRODUCTIONDATETIME\n NUM_VAL = 1\n VALUE = "2015-07-19T18:29:43Z"\n END_OBJECT = PRODUCTIONDATETIME\n\n OBJECT = START_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T20:17:22Z"\n END_OBJECT = START_DATE\n\n OBJECT = STOP_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T21:03:52Z"\n END_OBJECT = STOP_DATE\n\n END_GROUP = TEMPORALINFORMATION
"""
def find_stop_date(s):
in_stop_date=False
result=None
for line in s.split("\n"):
line = line.strip()
if re.search(r"^OBJECT.*=.*STOP_DATE", line):
in_stop_date=True
if re.search(r"^END_OBJECT.*=.*STOP_DATE", line):
in_stop_date=False
if in_stop_date:
re_result = re.search("VALUE\s*=\s*(.*)", line)
if (re_result):
result = re_result.group(1)
return result
result = find_stop_date(input_data)
if result:
print("Found: {}".format(result))
else:
print("not found")
您可以使用这个正则表达式:
STOP_DATE.+?VALUE\s*=\s*\"(.+?)\"
Python 命令:
import re
regex = r"STOP_DATE.+?VALUE\s*=\s*\"(.+?)\""
match = re.search(regex, test_str, re.DOTALL)
print(match.group(1))
其中 test_str
是字符串的名称。
结果:
2015-07-11T21:03:52Z
Sven 的回答不够精炼,我的模式会 运行 快 5 倍并且可以省略 DOTALL
标志:STOP_DATE[^"]+"([^"]+)
import re
test_str = '''GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION'''
print re.search( r'STOP_DATE[^"]+"([^"]+)', test_str).group(1)
# 2015-07-11T21:03:52Z
性能提升来自使用两个贪婪 "negated capture classes" 而不是点。
由于所需的子字符串是唯一跟在 STOP_DATE
后面的双引号值,双引号是唯一需要识别的字符。
如果您的实际数据有其他双引号的值,并且您需要指定 VALUE
,那么您可以使用:STOP_DATE[^"]+VALUE[^"]+"([^"]+)
但所需的步骤会膨胀到我之前模式的 2.5 倍(但仍然比 Sven 快 2 倍)。
如何从 Python 中的这个长字符串中提取 STOP_DATE
值?
GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION
正如其他人所展示的,您可以将其作为单行正则表达式来执行,但这样更清楚:
import re
input_data=""" GROUP = TEMPORALINFORMATION\n\n OBJECT = PRODUCTIONDATETIME\n NUM_VAL = 1\n VALUE = "2015-07-19T18:29:43Z"\n END_OBJECT = PRODUCTIONDATETIME\n\n OBJECT = START_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T20:17:22Z"\n END_OBJECT = START_DATE\n\n OBJECT = STOP_DATE\n NUM_VAL = 1\n VALUE = "2015-07-11T21:03:52Z"\n END_OBJECT = STOP_DATE\n\n END_GROUP = TEMPORALINFORMATION
"""
def find_stop_date(s):
in_stop_date=False
result=None
for line in s.split("\n"):
line = line.strip()
if re.search(r"^OBJECT.*=.*STOP_DATE", line):
in_stop_date=True
if re.search(r"^END_OBJECT.*=.*STOP_DATE", line):
in_stop_date=False
if in_stop_date:
re_result = re.search("VALUE\s*=\s*(.*)", line)
if (re_result):
result = re_result.group(1)
return result
result = find_stop_date(input_data)
if result:
print("Found: {}".format(result))
else:
print("not found")
您可以使用这个正则表达式:
STOP_DATE.+?VALUE\s*=\s*\"(.+?)\"
Python 命令:
import re
regex = r"STOP_DATE.+?VALUE\s*=\s*\"(.+?)\""
match = re.search(regex, test_str, re.DOTALL)
print(match.group(1))
其中 test_str
是字符串的名称。
结果:
2015-07-11T21:03:52Z
Sven 的回答不够精炼,我的模式会 运行 快 5 倍并且可以省略 DOTALL
标志:STOP_DATE[^"]+"([^"]+)
import re
test_str = '''GROUP = TEMPORALINFORMATION
OBJECT = PRODUCTIONDATETIME
NUM_VAL = 1
VALUE = "2015-07-19T18:29:43Z"
END_OBJECT = PRODUCTIONDATETIME
OBJECT = START_DATE
NUM_VAL = 1
VALUE = "2015-07-11T20:17:22Z"
END_OBJECT = START_DATE
OBJECT = STOP_DATE
NUM_VAL = 1
VALUE = "2015-07-11T21:03:52Z"
END_OBJECT = STOP_DATE
END_GROUP = TEMPORALINFORMATION'''
print re.search( r'STOP_DATE[^"]+"([^"]+)', test_str).group(1)
# 2015-07-11T21:03:52Z
性能提升来自使用两个贪婪 "negated capture classes" 而不是点。
由于所需的子字符串是唯一跟在 STOP_DATE
后面的双引号值,双引号是唯一需要识别的字符。
如果您的实际数据有其他双引号的值,并且您需要指定 VALUE
,那么您可以使用:STOP_DATE[^"]+VALUE[^"]+"([^"]+)
但所需的步骤会膨胀到我之前模式的 2.5 倍(但仍然比 Sven 快 2 倍)。