从 SGML 中提取纯文本
Extract plain text from SGML
我有一个SGML格式的528k文档列表,其中一个文档的示例如下:
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>
我想在<TEXT></TEXT>
之间提取palin文本,想要的结果如下:
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
94UM0312D Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
Python/Java 中是否有允许这样做的库或工具?
您可以在 python
中使用 BeautifulSoup
我尝试了这段代码并获得了所需的输出。
from bs4 import BeautifulSoup
with open('file.txt','r') as fo:
sgml=fo.read()
soup = BeautifulSoup(sgml,'html.parser')
text_list=soup.find_all('text')
for item in text_list:
lines_in_item=item.text.split('\n')
[print(x.strip()) for x in lines_in_item if x.strip()!=""]
输出
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
Cooperation
94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
file.txt
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>
我有一个SGML格式的528k文档列表,其中一个文档的示例如下:
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>
我想在<TEXT></TEXT>
之间提取palin文本,想要的结果如下:
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
94UM0312D Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
Python/Java 中是否有允许这样做的库或工具?
您可以在 python
中使用 BeautifulSoup我尝试了这段代码并获得了所需的输出。
from bs4 import BeautifulSoup
with open('file.txt','r') as fo:
sgml=fo.read()
soup = BeautifulSoup(sgml,'html.parser')
text_list=soup.find_all('text')
for item in text_list:
lines_in_item=item.text.split('\n')
[print(x.strip()) for x in lines_in_item if x.strip()!=""]
输出
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
Cooperation
94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
file.txt
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>