Python XML 删除标签内的换行符

Question

问题是，在我从 SEC 抓取的一些 xml 文件中，标签内有换行符。因此，这些 xml 文件格式不正确。

<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of .87; breakdown of shares sold and per share sale prices are as follows; 100 at .31; 200 at .32; 100 at .33; 198 at .39; 600 at .40; 100 at .41; 102 at .42; 600 at .44; 320 at .45; 100 at .46; 900 at .47; 480 at .48; 300 at .49; 1,200 at .50; 400 at .51; 1,130 at .52; 600 at .53; 100 at .54; 1,500 at .55; 600 at .56; 644 at .57; 1,656 at .58; 1,070 at .59; 2069 at .60; 1,831 at .61; 1,000 at .62; 1,000 at .63; 492 at .64; 1,400 at .65; 920 at .66; 1,000 at .67; 600 at .68; 500 at .69; 1,200 at .70; 500 at .71; 582 at .72; 400 at .73; 1,108 at .74; 37 at .75; 710 at .76; 630 at .77; 1,600 at .78; 400 at .79; 400 at .80; 1,500 at .81; 1,100 at .82; 100 at .83; 800 at .84; 200 at .85; 1,300 at .87; additional shares sold continued on Footnote (5).</footnot
e>

我一开始以为是utf-8和ISO-8859-1的编码不同，但改了编码后问题依旧。我的下一个解决方案是一个正则表达式，它可以检测标签内的换行符，但由于它们随处可见，因此该解决方案不是很可靠。

你们有什么解决这个问题的想法吗？

Answer 1

对于this txt file with xml part inside可以这样做：

import re

# open the txt file
with open("0001112679-10-000086.txt", "r", encoding="utf8") as f:
    txt = f.read();

# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]

# process the xml part
xml = re.sub(r"([^\n]{1023})\n", r"", xml)

# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]

# save the new txt in file
with open("0001112679-10-000086_output.txt", "w", encoding="utf8") as f:
    f.write(new_txt)

Python XML 删除标签内的换行符

Python XML remove newlines inside tag

python

regex

xml

well-formed