从 SGML 中提取纯文本

Extract plain text from SGML

我有一个SGML格式的528k文档列表,其中一个文档的示例如下:

<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT>    "jpuma009__l94008" </HT>


<HEADER>
<AU>   JPRS-UMA-94-009-L </AU>
JPRS 
Central Eurasia 

</HEADER>

<ABS>  Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 &amp; 2, </ABS>


<TEXT>
1993 
<DATE1>   17 June 1994 </DATE1>
<F P=100></F>
<F P=101>   Arms, Military Equipment </F>
<H3> <TI>   `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE>    `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
  Cooperation 

<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA, 
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>

22-28--FOR OFFICIAL USE ONLY 
<F P=103> 94UM0312D </F>
<F P=104>  Moscow VOORUZHENIYE, POLITIKA, 
KONVERSIYA </F>

<F P=105>  Russian </F>
CSO 

<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........ 

</TEXT>

</DOC>

我想在<TEXT></TEXT>之间提取palin文本,想要的结果如下:

1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
94UM0312D Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........

Python/Java 中是否有允许这样做的库或工具?

您可以在 python

中使用 BeautifulSoup

我尝试了这段代码并获得了所需的输出。

from bs4 import BeautifulSoup
with open('file.txt','r') as fo:
    sgml=fo.read()
soup = BeautifulSoup(sgml,'html.parser')
text_list=soup.find_all('text')
for item in text_list:
    lines_in_item=item.text.split('\n')
    [print(x.strip()) for x in lines_in_item if x.strip()!=""]

输出

1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
Cooperation
94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........

file.txt

<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT>    "jpuma009__l94008" </HT>


<HEADER>
<AU>   JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia

</HEADER>

<ABS>  Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 &amp; 2, </ABS>


<TEXT>
1993
<DATE1>   17 June 1994 </DATE1>
<F P=100></F>
<F P=101>   Arms, Military Equipment </F>
<H3> <TI>   `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE>    `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
  Cooperation

<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>

22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104>  Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>

<F P=105>  Russian </F>
CSO

<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........

</TEXT>

</DOC>