OCR 和提取特定子字符串后的文本 - 使用 Python 的正则表达式
OCR and extracting text that follows a specific substring - regex using Python
我是 Regex 的新手,所以我确信我遗漏了一些明显的东西,但需要帮助解决以下问题。
我想提取特定子字符串后面的字符串。我正在处理扫描文档列表并具有以下示例字符串,我想在 "FORENAME"
之后提取所有内容
这是我到目前为止所做的:
regex = r"(?<=(FORE))[A-Z]+"
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM. '
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
return以下是:
Match 1 was found at 78-82: NAME
Group 1 found at 74-78: FORE
我想要的 return 是:
GUIDO \ SURNAME VAN ROSSUM.
谢谢!
What I want it to return is:
GUIDO \ SURNAME VAN ROSSUM.
基于以上,您可以使用:
import re
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM.'
result = re.sub(r"^.*FORENAME(.*?)$", r"", test_str)
print(result)
# GUIDO \ SURNAME VAN ROSSUM.
如此简单的问题不需要正则表达式
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM. '
pos = test_str.find("FORENAME") + len("FORENAME")
print(test_str[pos:])
我是 Regex 的新手,所以我确信我遗漏了一些明显的东西,但需要帮助解决以下问题。
我想提取特定子字符串后面的字符串。我正在处理扫描文档列表并具有以下示例字符串,我想在 "FORENAME"
之后提取所有内容这是我到目前为止所做的:
regex = r"(?<=(FORE))[A-Z]+"
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM. '
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
return以下是:
Match 1 was found at 78-82: NAME
Group 1 found at 74-78: FORE
我想要的 return 是:
GUIDO \ SURNAME VAN ROSSUM.
谢谢!
What I want it to return is:
GUIDO \ SURNAME VAN ROSSUM.
基于以上,您可以使用:
import re
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM.'
result = re.sub(r"^.*FORENAME(.*?)$", r"", test_str)
print(result)
# GUIDO \ SURNAME VAN ROSSUM.
如此简单的问题不需要正则表达式
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \ SURNAME VAN ROSSUM. '
pos = test_str.find("FORENAME") + len("FORENAME")
print(test_str[pos:])