为什么 REGEX GROUP 有时将字符串列表视为字符串,有时将其视为列表?
why does REGEX GROUP treat list of strings sometimes as string and sometimes as list?
我正在使用正则表达式解析 pdf 并提取文本。
这里是 text_pos
的示例
text_pos = [['5. qwe', 'LLL LLL 23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee wwww', 'qwewww'],
['LLL LLL 54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww'],
['K K K K K K K K K K K K K 7 /', '111', 'zzz qqq qwe (ewq Lee)', 'qwee\neen', 'eewwww']]
这是我的代码的片段
text_pos = []
.
.
.
# REGEX
aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+\d.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
bbb = re.compile(r'(LLL\s+LLL)(.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
ccc = re.compile(r'(zzz|ttt\s+qqq)\s+(.*\))', flags = re.DOTALL | re.MULTILINE)
number = aaa.search(str(text_pos))
number1 = bbb.search(str(text_pos))
asker = ccc.search(str(text_pos))
try:
if number:
number.group(0)
except:
pass
try:
if number1:
number = number1.group(2)
except:
pass
try:
if asker:
asker.group(1)
except:
pass
data.append([number, asker])
df1 = pd.DataFrame(data, columns =['text', 'number'])
正则表达式有些工作,但有时它似乎将 text_pos
视为字符串,而有时则不是(仅返回 re.Match object
而不是实际字符)。
期望输出:
for v in df1['number']:
print(v)
23
54
7 /111
for v in df1['asker']:
print(v)
qqq ewq (qwe ewq)
qqq (eee www)
qqq qwe (ewq Lee)
实际输出:
for v in df1['number']:
print(v)
23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee wwww', 'qwewww'
54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
for v in df1['asker']:
print(v)
<re.Match object; span=(0, 59), match="['5. qwe', 'LLL LLL 23', 'zzz qqq>
<re.Match object; span=(24, 2203), match='ttt qqq (eee www)\', \'qwe>
<re.Match object; span=(47, 3015), match="zzz qqq qwe (ewq Lee)', 'q>
根据 Wiktor 的建议进行编辑:尝试对每个列表中的每个字符串进行正则表达式
for i in text_pos:
for j in i:
m = re.search(aaa, j)
if m:
number = m.group(0)
returns
for v in df1['number']:
print(v)
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
我无法解释为什么下面的方法有效,但它确实有效
text_list = ' '.join(map(str, text_pos))
aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
ccc = re.compile(r'(LLL\s+LLL)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
number = aaa.search(text_list)
number1 = ccc.search(text_list)
if number:
number = number.group(3)
else:
number = number1.group(3)
data.append([text_list, number])
fake_file_handle.close()
df1 = pd.DataFrame(data, columns =['text_list', 'WP / number'])
for v in df1['number']:
print(v)
23
54
7 / 1864
我正在使用正则表达式解析 pdf 并提取文本。
这里是 text_pos
text_pos = [['5. qwe', 'LLL LLL 23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee wwww', 'qwewww'],
['LLL LLL 54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww'],
['K K K K K K K K K K K K K 7 /', '111', 'zzz qqq qwe (ewq Lee)', 'qwee\neen', 'eewwww']]
这是我的代码的片段
text_pos = []
.
.
.
# REGEX
aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+\d.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
bbb = re.compile(r'(LLL\s+LLL)(.*)(zzz|ttt)', flags = re.DOTALL | re.MULTILINE)
ccc = re.compile(r'(zzz|ttt\s+qqq)\s+(.*\))', flags = re.DOTALL | re.MULTILINE)
number = aaa.search(str(text_pos))
number1 = bbb.search(str(text_pos))
asker = ccc.search(str(text_pos))
try:
if number:
number.group(0)
except:
pass
try:
if number1:
number = number1.group(2)
except:
pass
try:
if asker:
asker.group(1)
except:
pass
data.append([number, asker])
df1 = pd.DataFrame(data, columns =['text', 'number'])
正则表达式有些工作,但有时它似乎将 text_pos
视为字符串,而有时则不是(仅返回 re.Match object
而不是实际字符)。
期望输出:
for v in df1['number']:
print(v)
23
54
7 /111
for v in df1['asker']:
print(v)
qqq ewq (qwe ewq)
qqq (eee www)
qqq qwe (ewq Lee)
实际输出:
for v in df1['number']:
print(v)
23', 'zzz qqq ewq (qwe ewq)', 'ewq \nqwe', 'eee wwww', 'qwewww'
54', 'ttt qqq (eee www)', 'eeee\neee', 'aaaaa \nwww
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
for v in df1['asker']:
print(v)
<re.Match object; span=(0, 59), match="['5. qwe', 'LLL LLL 23', 'zzz qqq>
<re.Match object; span=(24, 2203), match='ttt qqq (eee www)\', \'qwe>
<re.Match object; span=(47, 3015), match="zzz qqq qwe (ewq Lee)', 'q>
根据 Wiktor 的建议进行编辑:尝试对每个列表中的每个字符串进行正则表达式
for i in text_pos:
for j in i:
m = re.search(aaa, j)
if m:
number = m.group(0)
returns
for v in df1['number']:
print(v)
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
<re.Match object; span=(2, 3470), match="K K K K K K K K K K K K K 7 /', '111', 'zzz >
我无法解释为什么下面的方法有效,但它确实有效
text_list = ' '.join(map(str, text_pos))
aaa = re.compile(r'(K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K\s+K)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
ccc = re.compile(r'(LLL\s+LLL)(([^zzz|ttt]*).*)', flags = re.DOTALL | re.MULTILINE)
number = aaa.search(text_list)
number1 = ccc.search(text_list)
if number:
number = number.group(3)
else:
number = number1.group(3)
data.append([text_list, number])
fake_file_handle.close()
df1 = pd.DataFrame(data, columns =['text_list', 'WP / number'])
for v in df1['number']:
print(v)
23
54
7 / 1864