为什么 REGEX GROUP 有时将字符串列表视为字符串,有时将其视为列表?

why does REGEX GROUP treat list of strings sometimes as string and sometimes as list?

我正在使用正则表达式解析 pdf 并提取文本。

这里是 text_pos

的示例
text_pos = [['5. qwe', 'LLL LLL  23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee  wwww', 'qwewww'],
            ['LLL LLL  54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww'],
            ['K K K K K K   K K K K K K K   7 /', '111', 'zzz qqq qwe (ewq Lee)', 'qwee\neen', 'eewwww']]

这是我的代码的片段

    text_pos = []
    .
    .
    .

    # REGEX
    aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+\d.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
    bbb = re.compile(r'(LLL\s+LLL)(.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
    ccc = re.compile(r'(zzz|ttt\s+qqq)\s+(.*\))', flags = re.DOTALL | re.MULTILINE)
    number = aaa.search(str(text_pos))
    number1 = bbb.search(str(text_pos))
    asker = ccc.search(str(text_pos))
    try:
        if number:
            number.group(0)
    except:
        pass
    try:
        if number1:
            number = number1.group(2)
    except:
        pass
    try:
        if asker:
            asker.group(1)
    except:
        pass
    
    data.append([number, asker])

df1 = pd.DataFrame(data, columns =['text', 'number']) 

正则表达式有些工作,但有时它似乎将 text_pos 视为字符串,而有时则不是(仅返回 re.Match object 而不是实际字符)。

期望输出:

for v in df1['number']:
    print(v)

23
54
7 /111

for v in df1['asker']:
    print(v)

qqq ewq (qwe ewq)
qqq (eee www)
qqq qwe (ewq Lee)

实际输出:

for v in df1['number']:
    print(v)

23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee  wwww', 'qwewww'
54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww
<re.Match object; span=(2, 3470), match="K K K K K K   K K K K K K K   7 /', '111', 'zzz >

for v in df1['asker']:
    print(v)

<re.Match object; span=(0, 59), match="['5. qwe', 'LLL LLL  23', 'zzz qqq>
<re.Match object; span=(24, 2203), match='ttt qqq (eee www)\', \'qwe>
<re.Match object; span=(47, 3015), match="zzz qqq qwe (ewq Lee)', 'q>

根据 Wiktor 的建议进行编辑:尝试对每个列表中的每个字符串进行正则表达式

for i in text_pos:
    for j in i:
        m = re.search(aaa, j)
        if m:
            number = m.group(0)

returns

for v in df1['number']:
    print(v)

<re.Match object; span=(2, 3470), match="K K K K K K   K K K K K K K   7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K   K K K K K K K   7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K   K K K K K K K   7 /', '111', 'zzz >

我无法解释为什么下面的方法有效,但它确实有效

    text_list = ' '.join(map(str, text_pos))
  
    aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
    ccc = re.compile(r'(LLL\s+LLL)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
    
    number = aaa.search(text_list)
    number1 = ccc.search(text_list)
    
    if number:
        number = number.group(3)
    else:
        number = number1.group(3)

data.append([text_list, number])
fake_file_handle.close()

df1 = pd.DataFrame(data, columns =['text_list', 'WP / number'])


for v in df1['number']:
    print(v)

23
54
7 / 1864