如何忽略正则表达式中不需要的模式
how to ignore unwanted pattern in regex
我有以下 python 代码
from io import BytesIO
import pdfplumber, requests
test_case = {
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}
for url, page in test_case.items():
rq = requests.get(url)
pdf = pdfplumber.load(BytesIO(rq.content))
txt = pdf.pages[page].extract_text()
txt = re.sub("([^\x00-\x7F])+", "", txt) # no chinese
pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
try:
auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
print(repr(auditor))
except AttributeError:
print(txt)
print('============')
print(url)
它产生以下结果
'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'
期望的结果是:
'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'
我试过了:
pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)$(?!Institute)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
此模式捕获后两种情况,但不捕获前 2 种情况。
pattern = r'.*\n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
这会产生所需的结果,但 ^(?!Hong|Kong)
存在潜在风险,因为它可能会在未来忽略其他所需的结果,因此它不是一个好的候选者。
相反,$(?!Institute)
更通用也更合适,但我不知道为什么它在前两种情况下无法匹配。如果有一种方法可以忽略包含 issued by the Hong Kong Institute of
的匹配项,那就太好了
如有任何建议,我们将不胜感激。谢谢。
pattern = r'\n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
这有效。
我有以下 python 代码
from io import BytesIO
import pdfplumber, requests
test_case = {
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0514/2020051400555.pdf': 59,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0529/2020052902118.pdf': 55,
'https://www1.hkexnews.hk/listedco/listconews/sehk/2020/0618/2020061800366.pdf': 47,
'https://www1.hkexnews.hk/listedco/listconews/gem/2020/0630/2020063002674.pdf': 30,
}
for url, page in test_case.items():
rq = requests.get(url)
pdf = pdfplumber.load(BytesIO(rq.content))
txt = pdf.pages[page].extract_text()
txt = re.sub("([^\x00-\x7F])+", "", txt) # no chinese
pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
try:
auditor = re.search(pattern, txt, flags=re.MULTILINE).group('auditor').strip()
print(repr(auditor))
except AttributeError:
print(txt)
print('============')
print(url)
它产生以下结果
'ShineWing'
'ShineWing'
'Hong Kong Standards on Auditing (HKSAs) issued by the Hong Kong Institute of'
'Hong Kong Financial Reporting Standards issued by the Hong Kong Institute of'
期望的结果是:
'ShineWing'
'ShineWing'
'Ernst & Young'
'Elite Partners CPA Limited'
我试过了:
pattern = r'.*\n.*?(?P<auditor>[A-Z].+?\n?)$(?!Institute)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
此模式捕获后两种情况,但不捕获前 2 种情况。
pattern = r'.*\n.*?(?P<auditor>^(?!Hong|Kong)[A-Z].+?\n?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
这会产生所需的结果,但 ^(?!Hong|Kong)
存在潜在风险,因为它可能会在未来忽略其他所需的结果,因此它不是一个好的候选者。
相反,$(?!Institute)
更通用也更合适,但我不知道为什么它在前两种情况下无法匹配。如果有一种方法可以忽略包含 issued by the Hong Kong Institute of
如有任何建议,我们将不胜感激。谢谢。
pattern = r'\n.*?(?P<auditor>(?!.*Institute)[A-Z].+?)(?:LLP\s*)?\s*((PRC.*?|Chinese.*?)?[Cc]ertified [Pp]ublic|[Cc]hartered) [Aa]ccountants'
这有效。