Python 和 re.split 中的正则表达式拆分错误的东西

Regular expression in Python and re.split splitting the wrong thing

我一直在尝试使用 Python 来组织文本,但我尝试使用 re.split 时没有用,即使我的正则表达式很好(我已经在 notepad++ 上试过了) .

我需要使用正则表达式拆分我的文本(并保留找到的内容),但文本正在逐个字符地拆分。

texttag 是一个 txt 文件,如下所示:

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>

我正在尝试拆分

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>

我正在尝试以这种方式拆分和标记它:

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

这是我现在的全部代码:

Dumas_XML=open("D:/cours/M1/S2/PTpython/GitHub/test2.txt","a") #C:/Users/super/Desktop/PTpython/GitHub/
#puverture du header xml
Dumas_XML.write('<?xml version="1.0" encoding="UTF-8"?>\n')
Dumas_XML.write('<Doc name="DUMAS" path="C:/Users/super/Desktop/PTpython/GitHub/textes"></Doc>\n') |6
Dumas_XML.write('<Document num="1" taille= "nombre de mots int()"/> </Document> \n ')

filetag = open("D:/cours/M1/S2/PTpython/GitHub/wordtag.txt")

import re
texttag= filetag.read()

regextag ="(<word>'CHAP'</word><pos> '[A-Z]{2,5}'</pos>\r\n<word>'.'</word><pos> 'PUNCT'</pos>\r\n<word>'[A-Z]{1,7}'</word><pos> '[A-Z]{1,7}'</pos>)"

xx=re.split(regextag, texttag)

compteurchap=0
for chap in xx :
    if re.search(regextag, chap) : 
        compteurchap=compteurchap+1
        Dumas_XML.write("<chap"+str(compteurchap)+">\n")
        print("<head>"+chap+"</head>")
        Dumas_XML.write("<head>"+chap+"</head>")
    #else:
        Dumas_XML.write(chap)
        Dumas_XML.write("</chap>\n")

我怎样才能正确地做到这一点?

如果您必须使用正则表达式,那么这可能是一个选项:

import re


pattern1 = re.compile(r"<word>.*?'NOUN'</pos>",re.MULTILINE | re.DOTALL)
pattern2 = re.compile(r"'NOUN'</pos>(.*)$", re.MULTILINE |re.DOTALL)

reobj = pattern1.search(texttag)

text = "<chap1>\n<head>"
text += reobj.group() + "\n</head>\n"
text += pattern2.findall(texttag)[0]
text += "\n</chap>\n"
print(text)
Dumas_XML.write(text)

输出:

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

这是否接近您要查找的内容?