Python returns 中的正则表达式无（使用正则表达式时用于搜索的搜索参数关键字）

Question

我不太确定正则表达式是如何工作的，但我正在尝试创建一个项目（还没有设置它，但首先使用测试 pdf 在代码的 pdf 索引端工作）分析标记方案 pdf，并在此基础上对有用的数据做任何事情。

问题是，当我在正则表达式中输入搜索参数时，returnpdf 中没有任何内容。我正在尝试使用以下代码中的 re.compile(r'\d{1} [A-D]') 迭代或遍历每一行，开头为 1 - 2 位数字（问题列），然后是 A-D（答案列）：

import re
import requests
import pdfplumber
import pandas as pd


def download_file(url):
    local_filename = url.split('/')[-1]
    
    with requests.get(url) as r:
        with open(local_filename, 'wb') as f:
            f.write(r.content)
        
    return local_filename



ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)

with pdfplumber.open(ap) as pdf:
    page = pdf.pages[1]
    text = page.extract_text()


#print(text)

new_vend_re = re.compile(r'\d{1} [A-D]')

for line in text.split('\n'):
    if new_vend_re.match(line):
        print(line)

当我运行代码时，我在 return 中没有得到任何东西。打印文本虽然会打印整页。

这是我正在尝试使用的 PDF：https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf

Answer 1

您在数字和标记之间只匹配了一个 space，但是如果您查看 text 的输出，则在 space 之间有多个 space数字和标记。

'9700/12  Cambridge International AS/A Level – Mark Scheme  March 2019\nPUBLISHED \n \nQuestion  Answer  Marks \n1  A  1\n2  C  1\n3  C  1\n4  A  1\n5  A  1\n6  C  1\n7  A  1\n8  D  1\n9  A  1\n10  C  1\n11  B  1\n12  D  1\n13  B  1\n...

将您的正则表达式更改为以下内容，不仅匹配一个，而且匹配 一个或多个 spaces:

new_vend_re = re.compile(r'\d{1}\s+[A-D]')

查看 alexpdev 的回答以了解 new_vend_re.match() 和 new_vend_re.search() 的区别。如果你在你的代码中运行这个，你将得到以下输出：

（你也可以在这里看到，总是有两个 space 而不是一个）。

//编辑：修复了正则表达式中的拼写错误

Python returns 中的正则表达式无（使用正则表达式时用于搜索的搜索参数关键字）

Regex in Python returns nothing (search parameters keywords for search for when using regex)

python

regex